Sampling, information extraction and summarisation of Hidden Web databases

نویسندگان

Yih-Ling Hedley

Muhammad Younas

Anne E. James

Mark Sanderson

چکیده

Hidden Web databases maintain a collection of specialised documents, which are dynamically generated in response to users’ queries. The majority of these documents are generated through Web page templates, which contain information that is often irrelevant to queries. In this paper, we present a system designed to detect and extract query-related information from documents sampled from databases. The proposed system, 2PS, is based on a two-phase framework for the sampling, extraction and summarisation of Hidden Web documents. In the first phase, 2PS queries databases with random terms selected from those contained in their search interface pages and the subsequently retrieved documents – this phase retrieves a pre-determined number of sampled documents. In the second phase, it detects Web page templates from the sampled documents in order to extract information relevant to respective queries from which a content summary is generated. 2PS is validated through the implmementation of a prototype system. Its evaluation is performed through experiments on a number of real-world Hidden Web databases. The experimental results demonstrate that 2PS effectively eliminates irrelevant information contained in Web page templates and generates terms and frequencies with improved accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CCReSD: concept-based categorisation of Hidden Web databases

Hidden Web databases dynamically generate results in response to users’ queries. The categorisation of such databases into a category scheme has been widely employed in information searches. We present a Concept-based Categorisation over Refined Sampled Documents (CCReSD) approach that effectively handles information extraction, summarisation and categorisation of such databases. CCReSD detects...

متن کامل

Information Extraction from Template-Generated Hidden Web Documents

The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (such as Google and Yahoo). Databases dynamically generate a list of documents in response to a user query – which are referred to as Hidden Web databases. Such documents are typically presented to users as templategenerated Web pages. This paper presents a new approa...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Aggregates Disclosure in Hidden Web Databases: an Urgent Challenge

Hidden web databases are widely prevalent on the Internet. Security issues specific to hidden databases, however, have been largely overlooked by the research community, possibly due to the (false) sense of security provided by the restrictive access (i.e., web interface) to such databases. We argue that an urgent challenge facing today’s hidden databases is the disclosure of sensitive aggregat...

متن کامل

Comparison of Bibliographic Databases in Retrieving Information on Telemedicine

Background & Aims: Some of the main questions which can be of importance for those researchers who intend to perform a systematic review in a field of science are: ‘What databases should I use for my review?’; ‘Do all these databases have the same value?’; and ‘Which sourcesretrieved the highest of relevant references?’. The main aim of this work was the identification of the best database for ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

Data Knowl. Eng.

دوره 59 شماره

صفحات -

تاریخ انتشار 2006

Sampling, information extraction and summarisation of Hidden Web databases

نویسندگان

چکیده

منابع مشابه

CCReSD: concept-based categorisation of Hidden Web databases

Information Extraction from Template-Generated Hidden Web Documents

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Aggregates Disclosure in Hidden Web Databases: an Urgent Challenge

Comparison of Bibliographic Databases in Retrieving Information on Telemedicine

عنوان ژورنال:

اشتراک گذاری